6 research outputs found
Synthesising prosody with insufficient context
Prosody is a key component in human spoken communication, signalling emotion, attitude, information structure, intention, and other communicative functions through perceived variation in intonation, loudness, timing, and voice quality. However, the prosody in text-to-speech (TTS) systems is often monotonous and adds no additional meaning to the text. Synthesising prosody is difficult for several reasons: I focus on three challenges. First, prosody is embedded in the speech signal, making it hard to model with machine learning. Second, there is no clear orthography for prosody, meaning it is underspecified in the input text and making it difficult to directly control. Third, and most importantly, prosody is determined by the context of a speech act, which TTS systems do not, and will never, have complete access to. Without the context, we cannot say if prosody is appropriate or inappropriate. Context is wide ranging, but state-of-the-art TTS acoustic models only have access to phonetic information and limited structural information. Unfortunately, most context is either difficult, expensive, or impos- sible to collect. Thus, fully specified prosodic context will never exist. Given there is insufficient context, prosody synthesis is a one-to-many generative task: it necessitates the ability to produce multiple renditions. To provide this ability, I propose methods for prosody control in TTS, using either explicit prosody features, such as F0 and duration, or learnt prosody representations disentangled from the acoustics. I demonstrate that without control of the prosodic variability in speech, TTS will produce average prosody—i.e. flat and monotonous prosody.
This thesis explores different options for operating these control mechanisms. Random sampling of a learnt distribution of prosody produces more varied and realistic prosody. Alternatively, a human-in-the-loop can operate the control mechanism—using their intuition to choose appropriate prosody. To improve the effectiveness of human-driven control, I design two novel approaches to make control mechanisms more human interpretable. Finally, it is important to take advantage of additional context as it becomes available. I present a novel framework that can incorporate arbitrary additional context, and demonstrate my state-of- the-art context-aware model of prosody using a pre-trained and fine-tuned language model. This thesis demonstrates empirically that appropriate prosody can be synthesised with insufficient context by accounting for unexplained prosodic variation
Using generative modelling to produce varied intonation for speech synthesis
Unlike human speakers, typical text-to-speech (TTS) systems are unable to
produce multiple distinct renditions of a given sentence. This has previously
been addressed by adding explicit external control. In contrast, generative
models are able to capture a distribution over multiple renditions and thus
produce varied renditions using sampling. Typical neural TTS models learn the
average of the data because they minimise mean squared error. In the context of
prosody, taking the average produces flatter, more boring speech: an "average
prosody". A generative model that can synthesise multiple prosodies will, by
design, not model average prosody. We use variational autoencoders (VAEs) which
explicitly place the most "average" data close to the mean of the Gaussian
prior. We propose that by moving towards the tails of the prior distribution,
the model will transition towards generating more idiosyncratic, varied
renditions. Focusing here on intonation, we investigate the trade-off between
naturalness and intonation variation and find that typical acoustic models can
either be natural, or varied, but not both. However, sampling from the tails of
the VAE prior produces much more varied intonation than the traditional
approaches, whilst maintaining the same level of naturalness.Comment: Accepted for the 10th ISCA Speech Synthesis Workshop (SSW10
Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0
In English, prosody adds a broad range of information to segment sequences,
from information structure (e.g. contrast) to stylistic variation (e.g.
expression of emotion). However, when learning to control prosody in
text-to-speech voices, it is not clear what exactly the control is modifying.
Existing research on discrete representation learning for prosody has
demonstrated high naturalness, but no analysis has been performed on what these
representations capture, or if they can generate meaningfully-distinct variants
of an utterance. We present a phrase-level variational autoencoder with a
multi-modal prior, using the mode centres as "intonation codes". Our evaluation
establishes which intonation codes are perceptually distinct, finding that the
intonation codes from our multi-modal latent model were significantly more
distinct than a baseline using k-means clustering. We carry out a follow-up
qualitative study to determine what information the codes are carrying. Most
commonly, listeners commented on the intonation codes having a statement or
question style. However, many other affect-related styles were also reported,
including: emotional, uncertain, surprised, sarcastic, passive aggressive, and
upset.Comment: Published to the 10th ISCA International Conference on Speech Prosody
(SP2020